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Abstract 

Recently, sparsity-based algorithms are proposed for super-resolution spectrum estimation. However, to achieve 
adequately high resolution in real-world signal analysis, the dictionary atoms have to be close to each other in 
frequency, thereby resulting in a coherent design. The popular convex compressed sensing methods break down 
in the presence of high coherence and large noise. We propose a new regularization approach to handle model 
collinearity and to obtain parsimonious frequency selection simultaneously. It takes advantage of the pairing structure 
of sine and cosine atoms in the frequency dictionary. A probabilistic spectrum screening is also developed for fast 
computation in high dimensions. A data-resampling version of high-dimensional Bayesian Information Criterion 
is used to determine the regularization parameters. Extensive experiments show the efficacy and efficiency of the 
proposed algorithms in various challenging situations with small sample size, high frequency resolution, and low 
signal-to-noise ratio. 

Keywords: spectral estimation, sparsity, super-resolution, nonconvex optimization, iterative thresholding, model 
selection, spectra screening. 

I. Introduction 

The problem of spectral estimation studies how signal power is distributed over frequencies, and has 
rich applications in speech coding, radar sonar signal processing and many other areas H). Suppose a 
discrete real signal is observed at finite time points contaminated with i.i.d. Gaussian noise. In common 
with all spectral models, we assume the signal can be represented as a linear combination of sinusoids, 
and aim to recover the spectrum of the signal at a desired resolution. However, the problem becomes very 
challenging when the required frequency resolution is high. In particular, the number of the frequency 
levels at the desired resolution can be (much) greater than the sample size, referred to as super-resolution 
spectral estimation. 

For such discrete signals of finite length, the classical methods based on fourier analysis or least- 
squares periodogram (LSP) 0, JH suffer from power leakage and have very limited spectral resolution . 
Some more recent algorithms, such as Burg [2], MUSIC (multiple signal classification) [5] and RELAX 
(61 only alleviate the issue to some extent. 

We assume that the signal is sparse in the frequency-domain, i.e., the number of its sinusoidal com- 
ponents is small relative to the sample size, referred to as spectral sparsity. It is a realistic assumption 
in many applications (such as astronomy and radar signal processing 0]|), and makes it possible to 
apply the revolutionary compressed sensing (CS) technique. In [7|, Chen and Donoho proposed the basis 
pursuit (BP) to handle overcomplete dictionaries and unevenly sampled signals. A number of similar 
works followed, see, e.g., ll8l- |[T5l and the references therein. 

We point out two crucial facts that cannot be ignored in super-resolution spectrum reconstruction, (a) 
When the desired frequency resolution is superbly high, neighboring dictionary atoms become very similar 
and thus necessarily result in high coherence or collinearity. As is well known in the recent literature, 
the popular convex l x technique as used in BP yields inconsistent frequency selection and suboptimal 
rates in estimation and prediction under such coherent setups lfT6l - ll2"0ll . (b) The grouping structure of the 
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sinusoidal components is an essential feature in spectrum recovery: if frequency / is absent in the signal, 
the coefficients for cos(2%ft) and sin(27r/i) should both be zero. 

In this paper we revisit the problem of super-resolution spectral recovery and propose a group iterative 
spectrum thresholding (GIST) framework to meet the previous challenges. It exploits the pairing structure 
and allows for nonconvex shrinkage estimation. We find that neither the l\ nor the l regularization 
is satisfactory and advocate a hybrid l + l 2 type penalty for spectrum estimation. The computation 
challenge in high dimensions can be handled by a supervised dimension reduction technique. Theoretical 
analysis shows that the new approach can essentially remove the stringent coherence requirement and 
can accommodate much lower SNRs. The rest of this paper is organized as follows. We introduce the 
challenging problem from a statistical point of view and briefly survey the literature in Section HO In 
Section [nil we propose the GIST framework. In more details, we investigate the form of regularization, 
propose an algorithm to fit the group nonconvex penalized model, discuss the selection criterion, and 
design a probabilistic spectral screening for fast computation. Experimental results are shown in Section 
ITVl We summarize the conclusions in Section [V] The technical details are left to the Appendices. 



II. Model Setup and The Super-resolution Challenge 

In this section, we introduce the problem of super-resolution spectrum estimation and review some 
existing methods from a statistical point of view. 

Let y = [y(£ n )]i<n<JV be a real-valued signal contaminated with i.i.d. Gaussian noise iV(0,a 2 ). The 
sampling time sequence {t n }i< n <N is not necessarily uniform (cf. 0). In order to achieve super-resolution 
spectral recovery, an overcomplete frequency dictionary must be applied. Concretely, we use a grid of 
evenly spaced frequencies f k = /max -k/D for k = 0,1, ■ ■ ■ , D to construct the sine and cosine frequency 
predictors, i.e., cos(2ntf k ) and sm(27itf k ). Let T denote the set of nonzero frequencies {fi, ■ ■ ■ , fr>}. The 
upper band limit / max can be (2 min 1 < n < A r(t n — t n _i)) _1 or estimated based on the spectral window ll2lTl . 
The cardinality of the dictionary controls the frequency resolution given by f max /D. The true spectra 
of the signal are assumed to be discrete for convenience, because the quantization error can always be 
reduced by increasing the value of D. The signal can be represented by 
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Vn = y{t n ) = ^2 Ak COS ( 27r fktn + <f>k) + Zr 



n = l,2,...,N, 



(1) 



k=0 



where A k , 4> k are unknown, and the noise {e n }^ =1 are i.i.d. Gaussian with zero mean and unknown 
variance a 2 . Different with traditional spectral analysis, D is allowed to take a much larger value than 
N. This is still a well-defined problem because only a few A k are nonzero under the spectral sparsity 
assumption. 

From A k cos(2tt f k t n + <p k ) = A k cos(<p k ) cos(2tc f k t n ) - A k sin(<p k ) sin(27rf fe t n ) = a k cos(2tc f k t n ) + 
b k sm(2ir f k t n ) with a k = A k cos<f) k , b k = — A k sin<j) k , we introduce 

sin(27rfi/) 



X cos (f) ^ [cos(27rt n /)] 



Kn<N 



cos(2irtif) 
cos(2irt N f) 



, X sh \f) ^ [ S m(27Tt n /)] 1<n 



<N 



sm(2irt N f) _ 



(2) 



and define the predictor matrix 

x ^ [x cos (fi), x cos Ud), X sin (A), • • • , X^Ud)}- (3) 

(Some redundant or useless predictors can be removed in concrete problems, see ©.) Denote the coefficient 
vector by j3 e M? D and the intercept (zero frequency component) by a. Now the model can be formulated 
as a linear regression 



y = a + X{3 + e, 



(4) 
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where (3 is sparse and e ~ N(0,a 2 I). In super-resolution analysis, D ^> N, giving a small-sample- 
size-high-dimensional design. Linear analysis such as Fourier transform fails for such an underdetermined 
system. 

As a demonstration, we consider a noisy 'TwinSine' signal at frequencies 0.25 Hz and 0.252 Hz with 
100 observations. This example is later used in simulation studies; see Section [IV] It has a practical 
background in source separation and target detection. Obviously, the frequency resolution needs to be as 
fine as 0.002 HZ to perceive and distinguish the two sinusoidal components with different coefficients. 
We set / max = 1/2, and thus 2D must be at least 500 - much larger than the sample size. The concrete 
design matrix (without the intercept) is given by 



X 



cos(7r^ti) ■ ■ ■ cos(7r^ti) sin(7r4ii) • ■ ■ sin(7r^7j^ii 



(5) 



The last sine atom disappears because all t n are integers. This yields a super-resolution spectral estimation 
problem. 

There are many algorithms for recovering the spectra of a discrete signal. But not all of them can achieve 
a desired high resolution. From a modeling perspective, we classify them as nonsparse methods and sparse 
methods. Most classical methods (e.g., O, flU, [12111 ') are nonsparse and assume no knowledge on the 
power spectra. For super-resolution spectrum estimation, they may seriously broaden the main lobes and 
introduce side lobes. In this paper, we focus on sparse methods. We briefly survey some relevant literature 
as follows. 

Some related works. As aforementioned, one popular assumption for solving underdetermined systems 
is signal sparsity: the number of present frequency components is small relative to the number of samples. 
The problem is still NP hard because the frequency location of the truly relevant sinusoidal components 
is unknown and the number of candidate components can be very large. (In fact, the frequency grid used 
for constructing the dictionary can be made arbitrarily fine by the customer.) 

Early attempts to enforce sparsity effects include greedy or exhaustive searches ||2"2]|. Il23l and genetic 
algorithms with a sparsity constraint ||24T| . Harikumar [251 computes the maximally sparse solutions under a 
constraint on the fitting error. A breakthrough is due to Chen & Donoho who proposed the basis pursuit 
(BP) for spectrum estimation. A number of similar works followed [T8Tl — [fTTTl . BP is able to superresolve 
for unevenly sampled signals. In our notation, the noiseless version of BP solves the convex optimization 
problem of 

minp||i s.t. a + X0 = y (6) 

(The noisy versions can be defined similarly, in a penalty/constraint form.) The Zi-norm provides the 
tightest convex relaxation to the / -norm and achieves a sparse spectral representation of the signal within 
feasible time and cost. In recent years, the power and limitation of the relaxation have been systematically 
studied in a large body of compressed sensing literature. In short, to guarantee good statistical performance 
in either prediction, estimation, or model selection, the coherence of the system must be low |[T6ll - |fT9ll , 
which implies that the super-resolution challenge cannot be fully addressed by the /x-norm based methods. 
(See the simulation experiments in Section [IV] for its inconsistent frequency selection and poor prediction 
accuracy.) When D is large, many similar sinusoidal components arise in the dictionary and bring in high 
coherence. However, to guarantee Zi's effectiveness in statistical accuracy, frequency selection consistency, 
and algorithmic stability, low coherence is a must — see, e.g, mutual coherence conditions |[T6ll , restricted 
isometry property IfTTl and irrepresentable conditions |[T9l in the theoretical studies. 

To enhance the sparsity of the BP, Blumensath & Davies proposed the iterative hard thresholding (IHT) 
lfT2l . [fT3l . See also Ifl4l . |[T5l for some approximation methods. Intuitively, nonconvex penalties can better 
approximate the / -norm and yield sparser estimates than the convex li -penalty. On the other hand, we 
find that when the signal-to-noise ratio (SNR) is low and/or the coherence is high, the l penalization 
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always gives an over-sparse spectral estimate and misses certain true frequency components. This issue 
will be examined in the next section. 

III. GIST Framework 

This section examines the super-resolution spectrum estimation in details. The complete group iterative 
spectrum thresholding (GIST) framework is introduced at the end. 

A. A novel regularization form 

In this subsection, we study a group penalized least-squares model and investigate the appropriate type 
of regularization. 

The BP finds a solution to an underdetermined linear system with the minimum l\ norm (cf. ©). When 
the signal is corrupted by noise as in (HI), the following /i-penalized linear model is more commonly used: 

i||l/-a-X/3||2 + A||/3||i, (7) 

where A is a regularization parameter to provide a trade-off between the fitting error and solution sparsity. 
The intercept or zero frequency component a is not subject to any penalty. To include more sparsity- 
enforcing penalties, we consider a more general problem in this paper which minimizes 

2D 

-\\y-a- X(3\\l + P(\fo\\ A) := A), (8) 

k=l 

where P(-; A) is a univariate penalty function parameterized by A and is possibly nonconvex. 

Some structural information can be further incorporated in spectrum estimation. From the derivation of 
©, Ak = implies 0k = (3n+k = 0, i.e., the sine and cosine predictors at fk vanish simultaneously. The 
pairing structure shows it is more reasonable to impose group sparsity on {(/3k, 0D+k)}i<k<D rather than 
the unstructured sparsity on {0k}i<k<2D- The group penalized model with the model design © minimizes 

1 D 

-\\y-a- -XT/3H2 + E P (A' + P™* A ) :=F ^ A )- < 9 > 

fc=i 

(In the problem with the design matrix given by ©, the last sine predictor disappears and thus we always 
set (3 2 d t0 be 0.) The penalty function P is the same as before and is allowed to be nonconvex. (For ease 
in computation, the first term in © and © will be replaced by — a — X/3\\%/C for some C large 
enough. See the comment after Theorem [TJ) 

A crucial problem is then to choose the appropriate form of P for regularization purposes. The popular 
/i -penalty P\(t\ A) = \\t\ gives insufficient sparsity and relatively large prediction error, as is shown in the 
previous TwinSine example. There is still much room for improvement in the problem of super-resolution 
spectral estimation. Before we proceed, it is worth pointing out that there are two objectives involved in 
this task 

Objective 1 (Ol): accurate prediction of the signal at any new time point in the time domain; 
Objective 2 (02): parsimonious spectral representation of the signal in the Fourier domain. 
01+02 complies with Occam's razor principle — the simplest way to explain the data is the best. A perfect 
approach must reflect both concerns to produce a stable sparse model with good generalizability. 
From the perspective of 02, the / -norm constructs an ideal penalty 

P (t;A) = yl^ . (10) 
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Yet it is discrete and strongly nonconvex. Interestingly, given any model matrix, the class of penalties 
aP H (t; A/ ' \fa) for any a > 1 fully mimics the behavior of (flOl) , where P H , referred to as the hard-penalty, 
is defined by 



P H (t;X) 



-t 2 /2 + X\t\, if \t\ < A 
A 2 /2, if |t| > A. 



(11) 



Fig. Q] illustrates the penalty family in a neighborhood around 0. The continuous penalty (fTTT) is not an 
approximation but an equivalent variant to the discrete penalty (flOl) in optimization. See [26] . 




-1 -0.8 -0.6 -0.4 -0.2 



Fig. 1. The nonconvex 'hard' penalty family (a > 1) in a neighborhood around 0. All penalties lead to the same estimators in the (group) 
penalized least-squares setup. The discrete lo -penalty Po corresponds to a — oo. The one with the smallest curvature is given by Ph with 
= 1. 



A different type of regularization is desirable seen from Ol. Even if all truly relevant sinusoidal 
components could be successfully located, these atoms are not necessarily far apart in the frequency 
domain, and thus collinearity may occur. In signal processing, Tikhonov regularization is the effective 
means to deal with the singularity issue which seriously affects estimation and prediction accuracy. It is 
in the form of an l 2 -noxm penalty 

P R (t;r ] ) = \vt 2 , (12) 

also known as the ridge penalty in statistics. 

Taking into account both concerns, we advocate the following hybrid hard-ridge (HR) penalty as a 
fusion of CCD and (fl2l) : 

\-\t 2 + A|t|, if|t|<-A_ 

The hard portion induces sparsity for small coefficients, while the ridge portion, representing Tikhonov 
regularization, helps address the coherence of the design and compensates for noise and collinearity. In 
Section IIII-BI and Section IIII-CL we will show that such defined hard-ridge penalty also allows for ease 
in optimization and has better frequency selection performance. 

Finally, we point out a similar regularization, the elastic net ||27l , which adds an additional ridge penalty 
in the lasso problem ©. However, this l± + 1 2 penalty, Ai||/3||i + A| || /3 1| §/2, may over-shrink the model 
(referred to as the double-shrinkage effect ETTl l and can not enforce higher level sparsity than the l\- 
penalty. In contrast, using a g-function trick [|28l . it is shown that Phr results in the same estimator as 
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the H + l 2 ' penalty 



A 2 1 



P(t;\,rj) = -—1 ¥0 + - V t ■ (14) 

The ridge part does not affect the nondifferential behavior of the / - n orm at zero, and there is no double- 
shrinkage effect for nonzero coefficient estimates. 

B. Fitting algorithms 

We discuss how to fit the group penalized model © for a wide class of penalty functions. We assume 
both X and y have been centered so that the intercept term vanishes in the model. 

Our main tool to tackle the computational challenge is the class of O-estimators [26]. Let ©(■; A) be 
an arbitrarily given threshold function which is odd, monotone, and a unbounded shrinkage rule (see E6l 
for the rigorous definition) with A as the parameter. A group 6-estimator is defined to be a solution to 

(3 = Q((3 + X T (y-X{3) ] \). (15) 

Here, for any £ e M. 2D , 0(£; A) is a 2P-dimensional vector £' satisfying 

Wk^'k+ol = [Zk,£k+D]Q(\\[Zk,M\W, A)/||[6,6+d]|| 2 ,i < k < D. 

In the simpler case when no grouping is assumed, the ©-estimator equation ([TBI reduces to 

f3 = Q((3 + X T (y-X(3);X). (16) 

A 6-estimator is necessarily a P-penalized estimator provided that 

r\t\ 

P(t; A) -P(0; A) = / (sup{s : Q(s; A) < u] - u) du + q(t; A) (17) 
Jo 

holds for some nonnegative q(-; A) satisfying q(Q(s; A); A) = for any s6l [|29l . Based on this result, 
we can compute P-penalized estimators by solving ([TBI) for an appropriate 0. 
We propose the following algorithm to fit the penalized model. 

Algorithm 1 GIST-fitting algorithm. 

given X (design matrix, normalized), y (centered), A (regularization parameter(s)), (thresholding 
rule), to (relaxation parameter), and f2 (maximum number of iterations). 

1) X •<— X/tq, y ^— y/r , with r > ||X|| 2 (spectral norm). 

2) Let j <— and f3^ be an initial estimate, say, 0. 
while ||/3^ +1 ' ) — (3^11 is small enough or j > Q do 

3.1) <- (1 - + co(/3 U) + X T {y - X(3^)) for j > 0; <- (3 {j) + X T {y - X/3 U) ) 

for j = 0; 
Group form: 

3.2a) <- V^r 1 ') 2 + (^ } ) 2 , I < k < D 

3.2b) If * 0, /^r 1 ) <- ej? +1) e(tf +1, ;A)/i? +1) and 0$$ <- ^e(/? +1 >; A)/l? +1) - 

Otherwise /3^ +1) = /3^ } = 0. 

Non-Group form: 

3.2') /3 (J+1) <- 0(^' +1) ;A); 
end while 
deliver /3 = (3 {j+1) . 
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We establish the convergence of the algorithm in the following theorem. For simplicity, assume that 
there is no intercept term in the model (which is reasonable when X and y have both been centered), 
and t = 1 > || X || 2 . Let £ = X T X. Construct an energy function for any 7, 3, £ G M 2£) as follows 

Gh, C, A €) = ~!l*7 - y\\t + p(t, a) + |( 7 - /3) r (i - s)( 7 - /3) + ^^(C - r (J - s)- x (c - 
+ + - ^Y 1 x T y -(i- s)-^] T (/ - s)[ 7 + (J - s)- 1 ^ - (/ - s)-^] 

+ ^[C - (I - E)/3 - X r y] T (J - S)- X [C - (I - E)/3 - X^] 

[£ _ (J _ S)/3 - X T #(I - S)- 1 !! - (/ - S)/3 - X T y], (18) 

with the non-group and group versions of P( 7 ; A) being £) fc P(|7fc|; A) and £]f =1 P (JiI + Id+v ^\ 
respectively. G( 7 , C, A is always greater than or equal to the objective function F( 7 ) as defined in Q 
or ([8]). This energy function can be used to prove the convergence of the iterates to a ©-estimator. 

Theorem 1: For any < to < 1 and a thresholding rule 9 satisfying (fTTT ), under the continuity 
assumption 21 in Appendix [A] Algorithm [JJ in either group form or non-group form converges, and 
the iterates (/3 (i) ,£ (j) ) satisfy 

G(3 u+1 \^ j+1 \ /3 (J+1) , A) < G(3^\ /3 {i) , A) — 5i — 5 2 , (19) 

where 5 1 = - £^) T (I - S)- 1 (€ (? ' +1) " and 5 2 = - £)(/3 (j) - + (1 - 

- € ij+ H T {I - - £)(/3 (j) - 3 U+1) ) + (1 - u){£ ij >- Furthermore, any limit 

point 3° of {3^'} is a group (or non-group) 6-estimator that satisfies (fT5T) (or (fT6l)). and the sequence 
G{3 {j \$} j \3 {i \i {j) ] A) decreases to the limit F(3°; A) with F defined in ® (or ®). 

See Appendix IA1 for the proof. Applying the theorem to Algorithm [JJ for an arbitrary X, we know the 
nongroup form solves the optimization problem — X/3||!/to + Sfc=i -^(1^1) A), and the group form 

solves |||y - X3\\1/t* + £? =1 P(^/3 fe 2 + A). 

From the theorem, Algorithm [JJ is justified for computing a penalized spectrum estimate associated 
with P, provided that a proper 9 can be found to satisfy (fT71) . This P-9 strategy covers most commonly 
used penalties, either in group form or non-group form. We give some examples below. 

When is the soft-thresholding, the P-function according to (TT71) is the ^-norm penalty used in BP, 
and the non-group version of our algorithm reduces to the iterative soft thresholding (see, e.g., 11301 ). The 
group li penalty (called the group lasso [13111 ) is more suitable for frequency selection, and can be handled 
by Algorithm [JJ as well. When 6 is the hard-thresholding, for q(-; A) = we get the hard-penalty (fTTT) . 
and for q(t; A) = lo<|t|<A we get the Z -penalty (fTOl) . The algorithm, in non-group form, corresponds 

to the iterative hard thresholding |fl2ll . [TT3l . Finally, if we define 6 to be the hard-ridge thresholding: 

fo, if \t\ < A 

e™(*;A,,)=|^ jf | ( | SA (20) 

then Pe HR is the hard-ridge penalty (fT3l . Setting q(t; A, rj) = -^(|t| — A) 2 l <|t|<A, we successfully reach 
the l + l 2 penalty (fl4j> . 

Algorithm UJ includes a relaxation parameter u, which is an effective means to accelerate the conver- 
gence. See the recent work by Maleki & Donoho 11321 . (Our relaxation form is novel and falls into Type I 
[|33l ). In practice, we set tu = 2, and the number of iterations can be reduced by about 40% in comparison 
to nonrelaxation form. 
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C. Statistical analysis 

Although the l\ regularization is popular (see, e.g., BP 0), in the following we show that the HR penalty 
has better selection power and can remove the stringent coherence assumption and can accommodate lower 
SNRs. We focus on the group form based on the discussion in Section [Till 

Let T be the entire frequency set covered by the dictionary. For the design matrix defined in ©, 
T = • - • , /d}- Given any frequency / G J, we use Xj to denote the submatrix of X formed 
by the sine and cosine frequency atoms at /, and f3f the corresponding coefficient vector. If / C T 
is an index set, Xj and f3 { are defined similarly. In general, Xf is of size N x 2 and (3^2x1 (but 
not always-cf. ©). Given any coefficient vector (3, we introduce z((3) = {/ G T : ||/3/||2 = 0} and 
nz{(3) = {/ G T : ||/3f H2 7^ 0} to characterize the frequency selection outcome. In particular, we write 
z* = z(J3*), nz* = nz{(3*) associated with the true coefficient vector (3*, and let p nz * = \nz*\ be the 
number of frequencies present in the true signal, and p z * = \z*\ the number of irrelevant frequencies. 
Denote by P 1 the probability that with soft-thresholding being applied, there exists at least one estimate (3 
from Algorithm Q] such that nz((3) = nz* . P02 is similarly defined for hard-ridge thresholding. Theorem 
|2] estimates these two probabilities. 

Before proceeding, we introduce two useful quantities k and u. Recall £ = X T X and Tq = ||£|| 2 = 
/i max (S) (the largest eigenvalue of £). Given I c J 7 , let Sj^/ = XjXp and £/ = XjXj. We 



assume the design matrix has been column-normalized such that the 2-norm of every column is yiV. Let 
S (s) = S/iV. Define 



H := /i min (S^l .), and « := max||S^ || 2 /^^, 

where jj, m m denotes the smallest eigenvalue and || ■ || 2 refers to the spectral norm. (5]}„ z * is of size 2 x 2p nz * 
typically.) Intuitively, k measures the 'mean' correlation between the relevant frequency atoms and the 
irrelevant atoms. When k is high, the coherence of the dictionary is necessarily high. 

Theorem 2: Assume \i > 0. 
(i) Let be the sof t-thr esholding. Under the assumption that k < u/p nz and A is chosen such that 
min /6M . \\(3}\\ 2 > 

e f M 2 I? \ 

1-Pl<7 [Pz*-772u+Pnz*-TJ7Z }, (21) 



4 V e M2 / 4 

where M := ^(1 - ^f-) and L := (min /enz « \\(3* f \\ 2 ~ ^ E )^- 

(ii) Let be the hard-ridge thresholding. Assume X,n are chosen such that k < - nt^^J ^ , t '■= 
min /enz . ||[(S nz . + tiI)- 1 ^ f \\ 2 > and n < /jN/t 2 . Then 

e / M' 2 L' 2 \ 

1 " P 02 < ^ [P^^MJyZ+Pnz*^ J , (22) 

where M' := ^(Ar ^ - ^«^||/3^|| 2 ) and V := (, - ^) ^^ . 
The proof is given in Appendix |Bj 

Seen from (|2TI) and (|22l) . both inconsistent detection probabilities are small. It is worth mentioning 
that in practice, we found the value of 77 is usually small, which, however, effectively handles singular- 
ity/collinearity in comparison to n = 0, as supported by the literature (see, e.g., lT34lO . In the following, 
we make a comparison of the assumptions and probability bounds. The setup of p z * ^> N ^> p nz * is of 
particular interest, which means the number of truly present frequencies is small relative to the sample 
size but the number of irrelevant frequencies is overwhelmingly large. 

The ^-conditions characterize coherence accommodation, while the conditions on min/ en2 » H2 and 
1 describe how small the minimum signal strength can be. (i) For the l\ penalty, k < [i/p nz * is a 
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version of the irrepresentable conditions and cannot be relaxed in general ||35l . In contrast, for the l + l 2 , 
the bound for k becomes large when i] is small, and so the stringent coherence requirement can be 
essentially removed! (ii) When 77 is small in the hard-ridge thresholding, the noiseless ridge estimator 
(£ n;Z * + r]I)~ l T lnz * (3* nz , is close to (3* nz *, but the minimum signal strength can be much lower than that 
of the h, due to N^l/tq = /i min (S^» jnz »)/ / u max (S^- ) ) < 1 + rj and in particular, the disappearance of 
y/p nz * . (iii) Finally, for small values of 77, M' > M, L' > L, and so l + 1 2 has a better chance to recover 
the whole spectra correctly. 

With the relevant frequencies located with high probability, introducing an adaptive ridge parameter 77 
is a must to guarantee good estimation and prediction accuracy, especially when the frequency resolution 
is quite high. This facilitates the choice of A because most parameter tuning strategies are prediction 
(generalization) error based. 

D. Model comparison criterion 

This part studies the problem of how to choose a proper regularization parameter A for any given data 
(X, y). Seen from ©, A provides a statistical bias-variance tradeoff in regularizing the model, and ought 
to be tuned in a data-driven manner. In common with most researchers in the compressed sensing area, 
we first specify a grid A = {Ai,---,A;,---,Al}, then run Algorithm \T\ for every A in the grid to get a 
solution path /3(\i), 1 <l < L, and finally, use a model comparison criterion to find the optimal estimate 
/3 opt . The commonly used model comparison criteria are Akaike information criterion (AIC), Bayesian 
information criterion (BIC), and cross-validation (CV). But we found none of them is satisfactory in the 
high-dimensional super-resolution spectral estimation. 

Ideally, in a data-rich situation, one would divide the whole dataset into a training subset denoted by 
(X trn ,y trn ) and a validation subset (X val , y val ). For any A G A, train the model on (X trn ,y trn ) and 
evaluate the prediction accuracy on the validation subset by, say, \\y val — X val /3(\)\\2. However, this 
data- splitting approach is only reasonable when the validation subset is large enough to approximate the 
true prediction error. It cannot be used in our problem due to the small-sample-size issue. 

A popular data-reusing method in small samples is the !?C-fold CV. Divide the dataset into % folds. Let 
(X^\ y^) denote the ^th subset, and (X^~^ , t/ - ^) denote the remaining data. To obtain the CV error 
at any A/ G A, one needs to fit Ki penalized models. Concretely, setting X = X^ 1 ^ and y = as the 

training data, solve the penalized problem associated with A/, the estimate represented by (3 (A;). Then 

calculate the validation error on (X^ ,y^): cv-err(A^, k) = \\y^ — X^j^ ^(A/)|||. The summarized 
CV error, cv-err(A/) = ^2f =i cv-err(A;, k)/N, serves as the comparison criterion. After the optimal A opt 

is determined, we refit the model on the global dataset to get /3 opt . 

However, when a nonconvex penalty is applied, the above plain CV has an inherent drawback: the 
Ki trained models at a common value of A/ may not be comparable, and thus averaging their validation 
errors may make little sense. The reasons are twofold, (a) The regularization parameter A appears in a 
Lagrangian form optimization problem (cf. d8]) or ©). In general, the optimal A to guarantee good selection 
and estimation must be a function of both the true coefficient vector (3* and the data (X, y). The same 
value of A may offer different regularization effects for different training datasets although /3* remains the 
same (consider three datasets as a toy example (i) X = X , y = y Q (ii) X = [Xq, Xq] t , y = [y^, y$ ] T 
and (iii) X = 2X , y = 2y ). In the trainings of ^C-fold CV, (X, y) changes (and so does the empirical 
distribution of X). Fig. |2] shows the numbers of nonzero coefficient estimates under the l penalization 
in 5-fold CV — they are never consistent at any fixed value of A! (b) The solution path /3(A) associated 
with a nonconvex penalty is generally discontinuous in A. Fig. [3] plots the Zo solution path for the default 
TwinSine signal. Even a small change in A may result in a totally different estimate and zero-nonzero 
pattern. In consideration of both (a) and (b), cross-validating A is not a proper tuning strategy in our 
problem. 

To resolve training inconsistency, we advocate a selective cross validation (SCV) as follows. First 
the sparsity algorithm is run on the entire dataset to get a solution path /3(A;), / = 1, • • • , L. Every 
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Fig. 2. The numbers of nonzero coefficients in 5-fold CV with respect to A. The 5 CV trainings yield (sometimes quite) different models 
at the same value of A. 



estimate /3(A;) determines a candidate model with the predictor set given by nzi = nz((3(\i)) = {fk G 
J 7 : pi + /3l +D 7^ 0}. Next, we cross-validate nzi (instead of A) to evaluate the goodness-of-fit of each 
candidate model. In this way, all \% trainings are restricted to the same subset of predictors. Concretely, 

for the / -penalty, (3 *~ (A;) is the unpenalized regression estimate fitted on (y(~®, X^~J^), while for 
the l + ^-penalty, f3 ^(Aj) is the ridge regression estimate fitted on (y(~^\ X^~^) due to Theorem [0 
which is directly given by (3 { ~ l \\i) = {{X^) T X^ + r)I) T {X^) T y^ . The total SCV error is 

summarized by SCV(A Z ) = £?=i \\y^ - X (0 /9 ( "^(A,)||1. 

Motivated by the work of [36], we add a high-dimensional BIC correction term to define the model 
comparison criterion: SCV-BIC(A;) = SCV(Az) + DF(/3(A;)) log N, where DF is the degrees of freedom 
function. When the true signal has a parsimonious representation in the frequency domain, i.e., the number 
of present frequencies is very small, such a correction is necessary — see [36]| for a further theoretical 
justification. For the l or /1 penalty, DF is approximately the number of nonzero components in the 
estimate; for the l + k penalty, DF(/3(A,)) is given by Tr{{X T nz X nZl +r 1 I)- 1 Xl z X nZl ) OH. The final 
optimal estimate f3 opt is chosen from the original solution path {f3(\i)}^ =1 by minimizing SCV-BIC(Az). 

Finally, we point out that in SCV, the sparsity algorithm is only required to run on the global dataset to 
generate one solution path, while CV needs % such solution paths. SCV is more efficient in computation. 
The spectral experiments in Section [IV] show that this tuning strategy is even as effective as large-data 
validation. 
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Fig. 3. The Zo-penalized solution path /3(A) is discontinuous in A. For clarity, only 5 frequency paths (chosen at random) are shown. 

E. Probabilistic Spectra Screening 

Computational complexity is another major challenge in super-resolution studies. In Algorithm CD each 
iteration step involves only matrix- vector multiplications and componentwise thresholding operations. Both 
have low complexity and can be vectorized. The total number of flops is no more than (ADN + 8D)Vl, 
which is linear in D. In our experiments, Vt = 1000 suffices and the practical time cost is only 87.6 
seconds (averaged over fifty runs) on an Intel 2.60GHz CPU. On the other hand, Algorithm Q] needs 
to be run multiple times for parameter tuning (for a grid of A values). With a superbly high resolution 
dictionary, dimension reduction is still desirable to further reduce the computational cost. 

This is indeed possible under the spectral sparsity assumption, where the number of true components 
is supposed to be much smaller than N. One may reduce the dimension from 2D to (say $ = 0.8) 
before running the formal algorithm. If the $N candidate predictors are wisely chosen, the truly relevant 
atoms will be included with high probability and the performance sacrifice in selection/estimation will be 
mild. Hereinafter, we call i3 the candidate ratio. A well designed screening algorithm should not be very 
sensitive to i3 as long as it is reasonably large. The total computational time can be significantly reduced 
after this supervised dimension reduction. 

We propose an iterative probabilistic screening algorithm by adapting Algorithm Q] for screening. This 
has the benefit that the screening principle is consistent with the fitting criterion. We recommend using 
the hard-ridge thresholding and the associated algorithm is stated below. 

In comparison with Algorithm [TJ the difference lies in (3.2b) and (3.2'), where a dynamic threshold 
is constructed in performing the hard-ridge thresholding. We next show that this screening version still 
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Algorithm 2 GIST-Screening algorithm. 

given X (design matrix, normalized), y (centered), 77 (Z 2 shrinkage parameter), (candidate ratio-ratio 
of new dimension to sample size), to (relaxation parameter), and Vt (maximum number of iterations). 
(For simplicity, assume fiN is an integer.) 

1) X 4- X/t , y 4- y/r , with r > ||X|| 2 . 

2) Let j <— and f3^ be an initial estimate say 0. 
while ||/3^ +1 ) — pu'W is small enough or j > Q do 

3.1) 4- (l-tu)^ j) +tu(/3 U) + X T (y-X/3 {j) )) for j > and «- /3 (j) + X T (y-X/3 (j) ) 

for j = 0; 
Group form: 

3.2a) 4- yJ(g +1) ) 2 + (& 2 > ^<k<D 

3.2b) Let A be the median of the ($N)th larest and (§N + l)th largest elements in {fjjf }. For 

1 < k < d, if e i] ± 0, <- V; a, ,)//r ^ 

<" fflM^; ^ T ))/l < k • Otherwise = ^ = 0. 

Non-Group form: 

3.2') j3^ +1 > 4— 6ifij(^ J+1 '; A, rj), where A z'5 f/ze median of the (§N)th largest component and the 
{f}N + l)th largest component of \/3^ +1 '\; 
end while 

deliver Remaining dimensions after screening: {/ G F : ||/3y +1 ^|| 2 7^ 0} (group version) or {k : 1 < 
k < 2D, (3 { k ]+1) 0} (non -group version). 



has convergence guarantee. Similar to Theorem [Q assume r = 1 > 1 1 J^C 1 1 2 - Let G be the same energy 
function constructed in (fTST) with P given by (TTSl or (fl4l) . For simplicity, suppose m := ^iV G N. 

Theorem 3: For any < to < 1, under the no-tie-occurring assumption 03 in Appendix the sequence 
of iterates (/3^,£^) from Algorithm |2] has the same function value decreasing property (fT9b for the 
energy function G, and satisfies nz(/3^) < m. Any limit point (3° of {/3^} is a ridge estimate 
restricted to X nz (p°\ with \nz(/3°)\ < m, and the sequence G((3 u \£ u \l3 {j \€ {j) ;\) decreases to the 
limit F((3°; A). Furthermore, under the model identifiability assumption £(m, 77) in AppendixO nz(/3^') 
stabilizes in finite steps, i.e., nz{(3^) does not change as j is large enough. 

See Appendix O for its proof. We use SCV to tune 77 or simply set i] = 0. This solves a Z -constrained 
problem. GIST-Screening works decently in super-resolution spectral analysis seen from the experiments: 
even if (l is as small as 20, the true signal is included after screening and the computational cost can be 
reduced to a large extent. 

An interesting observation is that with (3^ = 0, the first iteration step of Algorithm [2] ranks the 
frequencies based on X y. In other words, the correlation between the signal y and each dictionary 
atom is examined separately, to determine the candidate dimensions. Of course, this single frequency 
analysis is merely marginal and does not amount to joint modeling. The crude ranking is not suitable in 
super resolution studies due to the existence of many correlated frequency predictors. Algorithm [2] iterates 
to avoid such greediness. 

We found this screening algorithm to be pretty flexible. Even if sparsity is not desired, it can be applied 
separately with 1? > 1 to yield a meaningful result for super-resolution spectral analysis. 

F. GIST Framework 

We introduce the complete GIST framework to solve the spectral estimation problem. Fig. |4] shows the 
flowchart outline. 

1) Dictionary Construction and Normalization: We construct an overcomplete dictionary through © 
with sufficiently high resolution. Then standardize the data, by (a) centering y and (b) normalizing 
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each predictor column in X to have mean and variance 1 . After the standardization, all predictors 
are equally extended in the predictor space. 

2) GIST Spectrum Screening: This step can greatly reduce the computational complexity. We perform 
an iterative probabilistic screening to remove a number of nuisance frequency components and keep 
$N candidate predictors with •& < 1 (say •§ = 0.8) to achieve supervised dimension reduction. See 
Section IIII-EI for details. 

3) Model Fitting: For each given value of the regularization parameter in a predefined grid, run an 
iterative group-thresholding algorithm developed in Section IIII-BI to obtain a local optimum to 
([9]). All such solutions are collected to form a solution path parameterized by the regularization 
parameters. 

4) Model Selection: An optimal solution (3 opt is selected from the solution path based on a data- 
resampling version of high-dimensional BIC (Section IIII-DI) . 

5) Spectrum Recovery: The signal can be reconstructed from the coefficient estimate. The power 
spectrum density is estimated by PSD(/ fe ) = /3 opt fc + /3 opt D+k , 1 < k < D. 



Signal y(t n ) 

I 



Dictionary Construction and Normalization 

i ~ 



GIST Spectrum Screening 

i 



Model Fitting 



Model Selection 



I 



Spectrum Recovery 



Fig. 4. The flowchart of the GIST framework which shows the five main steps to solve the spectral estimation problem. 



IV. Experiments 

We conducted a series of experiments to show the performance of GIST in sparse spectral estimation. 
Our results support the following claims: 

1) Nonconvex penalties easily outperform the convex l± penalty in both frequency selection and time- 
domain prediction, with no need of evaluating the global minimum or using multiple random starts; 

2) The hybrid hard-ridge (or l + l 2 ) penalty is much better than the hard (or l ) penalty; 

3) Group penalized models are more appropriate than nongrouped models in spectral analysis; 

4) The SCV tuning works well in spectral estimation, even comparable to large-data validation; 

5) GIST-screening leads to efficient computation with little performance loss even for coherent systems 
with large noise. 

We have done extensive simulations. Due to limited space, the results of the typical TwinSine examples 
are presented unless otherwise specified. 
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A. Data Simulation Setup & Performance Measures 
Consider the discrete TwinSine signal given by 

y(t n ) = 2 cos(2tt • 0.25t n + vr/3) + 3 cos(2tt • (0.25 + 5)t n + vr/5) + e(t n ), (23) 

where 5 determines the desired resolution and e(t n ) is white Gaussian noise with variance a 2 . N training 
samples are observed at time t n = n, 1 < n < N. The spectrum frequency dictionary is constructed 
by setting the maximum frequency / max = 0.5 Hz, and the number of frequency bins D = f ma , x /S. The 
control parameters are the sample size N, frequency spacing 5, and noise variance a 2 . The default signal 
uses N = 100, a 2 = 1, and 5 = 0.002. We then vary one parameter at a time — for example, iV = 50, 150, 
5 = 0.002,0.005, or lOlgcr 2 = 0,10 — to study the algorithm performance with respect to sample size, 
noise level, or frequency spacing. For stability, each model is simulated 50 times, and then we report the 
mean or median of the benchmark measures defined as follows. 

As mentioned previously, the goal of spectral estimation is twofold: accurate signal prediction in the time 
domain and consistent spectrum recovery in the frequency domain. To measure the prediction accuracy, 
we generated test data in each simulation with the same control parameters (a 2 , 5) as the training data. 
They are independently observed at N test = 2000 time points different than those of the training data. The 
effective prediction error is given by MSE = YlflTiVi ~ X T& ~ ct) 2 /N tes t — c 2 - We found the histogram 
of MSE is highly asymmetric and far from Gaussian. Therefore, the median (instead of the mean) of 50 
MSEs was reported as the goodness of fit of the obtained model. 

We characterize the spectrum selection consistency by 1) Miss (M) rate - the mean of \{i : Pi ^ 0, pi = 
0}|/|{i : Pi 7^ 0}| in all simulations, where | ■ | is the cardinality of a set, 2) False Alarms (FA) rate - 
the mean of \{i : p { = 0, pi ^ 0}|/|{z : Pi = 0}| in all simulations, and 3) Joint Detection (JD) rate 
- the fraction of {i : ft ^ 0} C {i : ^ 0} among all runs. JD is the most important measure on 
easier problems while M makes the most sense for hard problems. An ideal approach for spectral analysis 
should have small MSE (echoing Ol), low M and FA, and high JD rates (echoing 02). 

B. Experimental Results 

1) Comparison with some existing methods: Before examining the influence of regularization form, 
grouping, SCV and screening, we first make a comparison of the advocated method (group hard-ridge 
GIST) to some existing methods in the literature. 

We implemented BP 0, IAA-APES (or IAA for short) [EU, SPICE (381, LZA-F El, and used the 
SparSpec software rtTTTl . These methods and our group hard-ridge GIST (or GIST for short) were applied 
to the default TwinSine signal; the spectral estimates from 50 simulation runs are plotted in Fig. |5] 

Seen from the dot plots, BP's frequency identification is not consistent: 0.254 Hz is always included 
in the selection. SparSpec may seriously underestimate the frequency coefficients, and failed twice (com- 
pletely). IAA, SPICE, and LZA-F do not have inherently sparse results. One must set an appropriate 
cutoff value to discern the present frequencies, which can be difficult in some simulation runs. In the 
experiments LZA-F has good frequency detection performance. GIST is even better and shows more 
concentrated signal power at the true frequencies. Its coefficient estimates are close to the truth. 

In what follows, we performs extensive computer experiments to study the limitation of the l\ selection, 
the appropriate nonconvex regularization type, the power of atom grouping, SCV tuning, and GIST- 
screening in super-resolution spectral analysis. 

2) Convex vs. nonconvex penalties: To investigate different forms of regularization in various challeng- 
ing setups in super-resolution spectral analysis, we first would like to know if the nonconvex penalties 
can outperform the convex l\ penalty. We compared the l\ relaxation to three nonconvex regularization 
forms: l , l + l 2 , and the grouped l + l 2 . All can be implemented using GIST. 

How to specify the choice of the parameters in each penalty is crucial in making a fair comparison 
between all algorithms. There are various parameter tuning schemes in the literature, sometimes even 
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Fig. 5. 50 spectral estimates using BP, SparSpec, IAA, SPICE, LZA-F, and GIST. 



based on heuristics. Motivated by the fact that an optimal parameter should guarantee good prediction, 
we generated an independent large validation dataset at N val = 2000 time points (different than those of 
the training data and test data). Mean squared validation error can then be used as a criterion to choose 
the optimal value of any parameter. We call this manner of tuning large-data validation which helps one 
to see the ideal power of a given penalty and ensures fair performance comparisons that are less affected 
by various ad-hoc parameter tuning plans. 

First, by varying the noise power, Fig. [6] compares the prediction accuracy of the four aforementioned 
regularization types. The li penalty is not as accurate as the nonconvex alternatives. Even when the SNR 
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is high, li shows no advantage. 

To give a comprehensive comparison in various spectrum estimation settings, we varied the control 
parameters (noise variance a 2 , frequency resolution 5, and samples size N) in the default signal. Table HI 
Table HI] and Table UIT1 show prediction and frequency selection results in terms of MSE, M, FA, and FD. 
Again, nonconvex penalties beat the /i-penalty in MSE in various situations. The limitation of the plain 
l\ regularization is obvious in super-resolution spectral recovery: the miss rates of the l\ penalty are as 
high as 50% and the joint identification rates can be as low as zero. 

To examine why the l\ selection is so poor, we plot the l\ coefficient estimates associated with four 
frequency components 0.248, 0.25, 0.252, 0.254 for all possible values of A in Fig. [7J It shows the 
misidentification results from / x are inherent (not due to parameter tuning). The true coefficients of the four 
components are given by [0, 1, 2.4, 0] T for cosine components, and [0, —1.7, —1.8, 0] T for sine components. 
To locate the true frequencies, one wishes to see for a certain value of A, all A (A) being zero except 
for the ones at two frequencies 0.25, 0.252. Yet the true sparsity pattern never appears in either solution 
path whatever value of A we choose. Even if the correct number of frequency components is known, one 
tends to select the cosine components with frequencies of 0.252, 0.254, and the sine components with 
frequencies of 0.248, 0.25, exactly the same as our large-validation tuning did. We call this phenomenon 
'power shifting' . The bizarre behavior of Z x is brought by the super-resolution challenge. When D is large, 
the sinusoidal components with frequencies close to each other result in high mutual coherence IfToll , 
seen from the Gram-matrix of the dictionary. The /i-relaxation cannot achieve stable sparse recovery for 
coherent designs [fTTl . ffl9l . 
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Fig. 6. Prediction errors (measured by mean-squared errors) of different regularization types. The noise power (lOlgcr 2 ) is varied from 
-10 dB to 10 dB. 

3) h + h vs. l : The previous experiment shows nonconvex penalties behaved significantly better than 
the l\ penalty. We further study which nonconvex penalty is preferred in challenging coherent problems. 
Here we focus on two nonconvex candidates: the Z -penalty as used in GIST-Hard, and the /o+^2 penalty as 
used in GIST-HR. As shown in the M columns of Tables IIIuTH the higher miss rates of l clearly suggest 
that it may yield an over-sparse model. The 'under- selection' caused by l ma Y m i ss true frequency 
components. To give an explanation of this phenomenon, notice that the l penalty offers no shrinkage 
at all for nonzero coefficients and its regularization is through hard-thresholding only. Therefore, the / 
penalization tends to kill too many predictors to achieve the appropriate extent of shrinkage especially 
when the SNR is low. An inappropriate nonconvex penalty may seriously mask true signal components. 

A natural improvement of the plain / would involve a second penalty term, of l 2 (ridge/Tikhonov) type. 
The / 2 portion helps address the coherence of the design and compensates for noise and collinearity. See 
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TABLE I 

Mean squared errors (MSE), miss rates (M), false alarm rates (FA), and joint detection rates (JD) of different 

REGULARIZATIONS, WITH VARYING VALUES OF a 2 AT 1, VTO, AND 10. 
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TABLE II 

Mean squared errors (MSE), miss rates (M), false alarm rates (FA), and joint detection rates (JD) of different 

REGULARIZATIONS, WITH VARYING VALUES OF 5 AT 0.002, 0.003, AND 0.005. 
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(fT3l) for this hybrid nonconvex penalty. The resulting sparsity pattern is better than that of the / penalty, 
evidenced by the smaller MSE and larger JD numbers. In a word, the / + /2-type penalty is more desired 
in super-resolution spectrum recovery and enjoys the sparsity and accuracy from / and I2, respectively. 

4) Group models vs. nongroup models: GIST allows for grouped versions of the algorithms. We found 
that taking advantage of the pairing structure of sine and cosine frequency atoms brings impressive 
improvement for any penalty in spectral analysis. Table HVl shows such performance gains for the convex 
li and the nonconvex / + h penalties, respectively. In comparison with the systematic experiments in 
Table U Table HH and Table Hill the grouped form of 1$ + 12 penalty is clearly the choice of regularization 
in both MSE and frequency selection. 

5) Selective cross validation tuning: After ensuring the appropriate form of regularization (group I0+I2), 
we turn to the tricky parameter tuning problem. We used the large-data validation for parameter tuning in 
previous comparisons, to appreciate the true potential of each algorithm in the ideal situation. This ideal 
validation tuning provides a good reference but is not practical for small- N datasets. Fortunately, we have 
the data-resampling based SCV to do the job decently. Table [V] compares SCV tuning with large-data 

TABLE III 

Mean squared errors (MSE), miss rates (M), false alarm rates (FA), and joint detection rates (JD) of different 

REGULARIZATIONS, WITH VARYING VALUES OF N AT 50, 100, AND 150. 
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Fig. 7. The Zi-penalized estimates $i(X) vs. scaled \\(3\\i (the dual parameter of A). For a better view, only 4 frequencies are 
shown. The green dashed and blue solid lines correspond to the true frequencies 0.25 and 0.252 respectively. 

TABLE IV 

Power of grouping for both convex and nonvex penalties. 
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validation at different noise levels for the group l + 1 2 penalty. In general, we observe that SCV, though 
only using the training data, has comparable performance to large-data validation in parameter tuning. 
Also, SCV only runs the sparsity algorithm once and does not increase much computation cost in the 
tuning stage. We recommend SCV for parameter tuning, especially when a nonconvex penalty is applied 
to a small sample problem. 

table v 

Performance of selective cross validation tuning in comparison to large-data validation. 
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6) Probabilistic spectral screening: We examine the performance of the GIST- screening algorithm (cf. 
Algorithm [2]) in this experiment. The candidate ratio i3 determines the dimension ($iV) of the reduced 
predictor space. Therefore, the lower the value of %9, the more efficient the computation, but also the 
higher the risk of mistakenly removing some true components. Our screening technique turns out to be 
pretty successful: even if we choose $N to be as small as 10 and set a very large noise variance a 2 = 10, 
it never misses any true frequency component and does not deteriorate MSE. Fig. [8] plots the location 
of 200, 100, and 50 remaining atoms, respectively. The selected frequencies are non-uniform; the density 
near the true spectra is higher. 

Next, we make a more challenging problem by modifying the signal to have large noise variance a 2 = 10 
and 10 present frequency components at 0.24, 0.242, . . . , 0.282. Fig. [9] shows the total computation time 
and the final detection results. The plotted time is the total running time of both GIST screening and 
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Fig. 8. Locations of the remaining frequency atoms after GIST screening with •ffN = 200, 100, 50. The true frequencies are indicated by 
red lines and stars. 
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Fig. 9. Computation time and detection performance of GIST screening on a hard problem with 10 present frequency components and 
noise power 10 lg a 2 = 10 (or a 2 = 10). The a;-axes represent -&N. 



model fitting and selection, averaged over 50 runs. The empirical experience is that GIST- screening is 
safe when $N is roughly 5 times greater than the number of truly relevant atoms. Under the sparse 
spectrum assumption, our original choice with ■§ = 0.8 seems to be good in practice, which reduces the 
computation complexity significantly with little performance lost. 

7) Computational cost: With GIST used in implementation, computing the solution path associated 
with a nonconvex penalty required 4 to 6 times as much time as computing the l x solution path. The 
screening technique can safely reduce half of the total computation time. In all, our algorithm for a given 
nonconvex penalty (say the hard-ridge penalty in the group form) ended within 5 minutes on a 2.66 
GHz CPU. In comparison with the popular Zi penalty, we think this is an acceptable tradeoff between 
performance and computational complexity in super-resolution spectral analysis. 
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V. Conclusions 

We have presented a sparsity-based group iterative spectrum thresholding (GIST) framework to meet 
the super-resolution challenge in spectral estimation. It is able to handle nonconvex penalties and take 
the pairing structure of sine and cosine atoms into account in regularizing the model. The hard-ridge 
penalty was shown to have some advantages over the popular convex l t penalty as well as the l penalty. 
The associated iterative probabilistic spectrum screening can be used for fast computation. In parameter 
tuning, the selected cross-validation criterion overcomes the training inconsistency issue of the plain CV 
and is much more computationally efficient. Like other sparse methods, GIST can be applied to unevenly 
sampled signals. Some future research topics include the extension of GIST to non-Gaussian models and 
multivariate responses. 
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Appendix A 
Proof of Theorem Q] 

We show the result for the group form only. The proof for the non-group form is similar and simpler. 
The following continuity assumption is made throughout the proof: 

Assumption 21: is continuous at any point in the closure of {$}^}- 
(For continuous thresholding rules such as soft-thresholding, this regularity condition always holds. Prac- 
tically used thresholding rules (such as hard-thresholding) have few discontinuity points and such discon- 
tinuities rarely occur in any real application.) For to = 1, see 11291 for the proof details. In the following, 
we assume < u < 1. 

Note that G is quadratic and convex in (3, £, and £, but possibly nonconvex and nonsmooth in 7. 

Lemma A.l: Given an arbitrary thresholding rule 0, let P be any function satisfying P{6\ A) — P(0; A) = 
P e (9; X) + q(9',X) where P Q (9; X) = J ( | 6 ''(sup{s : 0(s; A) < u} — u) du, q{6; X) is nonnegative and 
q(Q(t; A)) = for all t. Then, the minimization problem 

nj|ni||y-/3||2 + P(||/3|| a ;A) 

has a unique optimal solution given by /3 = Q(y; X) for every y provided that ©(•; A) is continuous at 

Nib- 
See (291 for its proof. 

Given (3 and £, the problem of minimizing G over (7, £) is simplified to (detail omitted) 

min -\\~f-u(I-Y,)l3-ujX T y- (1 - u)£\\ 2 2 + P{j] X), and (24) 
7 2 

mm ^ — [C - ui(I - E)/3 - coX T y - (1 - tu)£] T (I - E^fC - u,(I - E)/3 - coX T y - (1 - u)$. 

(25) 

Based on Lemma IA.ll the optimal solutions are j opt = Q(u(I — E)/3 + uX T y + (1 — A) and 
Copt = U (I ~ E)/3 + toX T y + (1 — Therefore, we obtain 

G(/3^ +1 \^ j+1 \f3^\^;X) < G(f3 ij \^ j \f3 u \^ ] ;X) 

_ £0'))T(j _ S )-l(£tf+l) _ ^0')). (26) 
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On the other hand, given 7 and £, G can be expressed as a quadratic form in (3 that is positive definite. 
The same fact holds for It can be computed that 

\Gp = w (I-S)09-7) + (l-w)(€-C) 
VG C = ^(1 - S)- l [w(I - S)(/3 - 7) + (1 - - C)], 

from which it follows that G can be written as 

- E)(/3 - 7) + (1 - - OfwV " S)- 1 !"^ - - 7) + (1 - w)(€ " C)] 

in addition to the terms involving only 7 and £. Hence f3 opt = (3 and £ opt = £ (though not unique) 
achieve the minimum. We obtain 

G(fiV +1 \ /3 (i+1) , A) < G(f3^ j+1 \ A) 

_ S)(/3 0) _ pCHD) + (i _ _ ^+D)]r (I _ s) -i . 

- E)(/3 W - /3 (J+1) ) + (1 - - (27) 

Summarizing ([26]) and ([27]) yields (TT9l) . 

Assume a subsequence — >• (3° as / — >■ 00. Since 

G((3 [jl \i {jl \(3 {jl \ - G{^ jl+1 \^ l+1 \ (3^ +l \$, (jl+l) ) 
<G{(3 {n \i {n \(3 {jl \$} n) ) - G(/3 {jl+1 \£ {jl+1 \ (3 {jl+1 \^ l+l) ) -+ 0, 

^Cj»)_^(ji+i) and thus (f3 Ul) -f3 Ul+1) ) ->■ 0. That is, (l-w)^ (ji) +w(/3 (j;) +X T ( ? /-X/3 (j!) ))-^ (ji) -> 
and 9(£ (ii) ; A) - -»■ 0. From X T (y - Xf3 Ul) ) - -+ and the continuity assumption, (3° is a 
group 9-estimate satisfying (QJ), and linx^ G(/3 {jl \ (3 Ul) , = F(/3°). 

Appendix B 
Proof of Theorem [2] 

Recall that T denotes the frequency set covered by the dictionary X and we assume all column norms 
of X are yN (or the diagonal entries of £ = X T X are equal to N). 

Applying Theorem [H we can characterize any group l x estimate (3 from Algorithm Q] by 

h = B0 + X T y/T 2 - S^/r 2 ; A) (28) 

with being the soft-thresholding function. Let s = s((3) denote a function of (3 satisfying 

||s/|| 2 <l,V/E*G9) and s f = (3 f /\\P f \\ 2 , V/ e nz(J3), (29) 

and s((3f) : = [s((3)]f. We have ||s(/3j)|| 2 < 1, V/ G T . (In the group Zi case, s is a subgradient of 
E/ e ^ 11/3/ II 2.) Then ([28]) reduces to /3 + As(/3, A) = + X T y/r 2 - £/3/r 2 or 

VP = X T y-\T*s{P\ (30) 



for some s satisfying (1291 . 

Lemma B.l: Assume S n ^* is nonsingular. Then (|30l) is equivalent to 




X^e + Ar 2 S z *, n ^S- 2 U(^) - Ar 2 s(^) 
^ + S-^X^.e - Ar 2 S (^)) - 



where S z * :— E 2 * — S 2 * in2 *S n2 1 »S n2 * )Z », and Xj* :— X T Z * — Y, z *^ nz *H n l,X^ lz , 
This is Lemma A.l in [|26l. 
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Lemma B.2: Suppose 



Z\ 

Z2 



N 








V , where V is a correlation matrix. Then for any M, 



P{z\ + z 2 > M 2 ) < P(£ > AP/2) with £ ~ x 2 (2). 

From || V || 2 < ||^||f < 2, 21 — V is positive semi-definite. Let z[, z' 2 be independent standard 
Gaussian random variables. We get P(z 2 + z\ > M 2 ) < P((^ , lV / 2) 2 + (z' 2 V2) 2 > M 2 ) = P(f > M 2 /2) 
from Anderson's inequality 11391 . 

Lemma B.3: Suppose f ~ x 2 (2). Then for any M, P(f > 2M 2 ) < M 2 e~ ( - M2 ~ 1 \ 

See, e.g., [ 401 for a proof of the standard x 2 ta il bound. 



Let X? = X T f - £ 

.4 



S^»X^», Wf ez*. From Lemma EH we have Pi > P(A n V), with 



X'Je + Ar 2 S /jnz *S^s(^) < Ar 2 , Vs satisfying V/ G ^ 



V := {Hp-^X^e]/^ + Ar 2 || [£^ S (^)]/|| 2 < ||/3/|| 2 ,Va satisfying <^,V/ G n/} . 

Therefore, 1 - Pi < P(A C U V c ) < P(A C ) + P{V C ). 

From the definition of k, ||E /jn;5 .£~**s(/3 nz *)|| 2 < ^VP^II^'Ib^V^^T = Kp nz */n, Vf e z*. It 
follows that 



P(A C ) < p 2 .P (||e}|| 2 > (1 - K Pnz ,/fi)\r 2 /(aVN) 



(32) 



where e' = X%e/(<rVN) ~ iV(0, S z */iV) because Xj*X' z * = S z *. Define M := {l-K Pnz * / /i)Ar Q 2 /(cy/N). 
Based on Lemma [BT2l and Lemma IB31 and the fact that the diagonal entries of S z * = £ 2 * — H z * )nz *'£~ z *'E nz * iZ * 
are all less than or equal to N, we obtain a bound for (1321) : 

P{A C ) < ^p z *M 2 e~ M2/ \ (33) 

Next we bound P(V C ). Suppose the spectral decomposition of £ nz » is given by UDU T with the ith 
row of U given by uj, then we can represent S^ 1 , as [uf .D -1 itjl , and thus diag(S^) < l/(N/j). 
Moreover, from ||S^s|| 2 < y/p^/(N/i), ||[S^.s]/|| 2 < y/p^/(Nfi), V/ G nz*. Introduce e" = 
y/jlN? l -^Xl zt e/((7) ~ iy(0,^JVE^). From the last two lemmas, 



P(n < Pn^P(||e';|| 2 > (pin \\f3}\\ 2 - Xrl^/^N))^^) < ^LV^ 4 



fGnz* 



a 



where L := (min /en2 , ||/3}|| 2 - Xr 2 ^p^/(^N))^. 

The proof of (l22l) for the hard-ridge thresholding follows similar lines. First, define s((3; \,r]) as 



\s f \\ 2 <l,Vf ez((3) and Sf = ^f3 f ,Vf E nz((3). 



Then similar to (1301) we have 



E/3 = X T 2/ -Ar 2 s(/3;A,r ? ), 



(34) 



(35) 



Let X' Z T := Xf. - S^^S- 1 ^/ - r](E nz * + rjI^X^. To bound P 02 , from Lemma ED and ([34]), 



we write (1351) as 

Xr 2 s0 zf] A, 77) + S*P Z . = X"Je + ^E^*^ + r^I)" 1 (3* nz , 
P nz , = (S n2 , + nr 2 !)- 1 ^^ + (S n ^ + V r 2 iy l x T nz ^ 



where we used £ n2 ,(£ n2 » + t/TqI) x H nz * = (£ n2 » + rjr^ I) x . We obtain 

P02 > P(3s satisfying dH s.t. = and || 2 > A/(l + rj),Vf G nz*) > P(A n V) 
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with A := {\\X" f T e + V T^ f , n A^nz* + mllY^lAh < Ar 2 ,V/ G z*} and V := {||[(£ n2 * + 
^1)^^(31^]} + [(Sn,* +r 1 rUr l X T n Mfh > i^.V/ G rw*}. For e' = X"Je/ (aVl?) and 
e" = fl! ^J^° (S n2 * + ^Tq J) _1 X^ 2 ,»e/cr, their covariance matrices can be computed-(£ 2 * — £ z * jnz *[I — 

r/ 2 (S n2 , + r / /)- 2 ]S^Sj, jnz ,)/iV and ^^(S^ + f ,r 2 /)- 1 S n ,.(S nz ,+ I| r 2 /)- 1 , respectively. Fur- 
thermore, it is not difficult to see that their diagonal entries are bounded by 1 (under r) < ^N/tq). 
Therefore, 

I-P02 < P(||e;|| 2 >^=(Ar 2 - m } 2 kv^w||/3^|| 2 ), for some/ e z*) 

+P(||e';|| 2 > (l - T^)^fi for some / G nz*) 
< - A p z *M' 2 e- M ' 2 l* + - A p z *L' 2 e- L ' 2 l\ 

^0 .- II fl* IU ar,H 77 — r, _ A \MjV+^ 



where M' := ^(Ar 2 - -A^v&fc || 2 ) and 1/ := (i 



1+7? ) yfjlNcr 



Appendix C 
Proof of Theorem [3] 

First, we introduce a group quantile thresholding rule 0#(-;m, rj) as a variant of the hard-ridge 
thresholding. Given 1 < m < \F\ and i] > 0, Q*(-;m,r]) : a G M? D — >■ b G R 2D is defined as 
follows: by = <2//(l + r/) if ||a/|| 2 is among the m largest norms in the set of {||a/||2 : / G J 7 }, and 
bf = otherwise. In the case of ties, a random tie breaking rule is used. For simplicity, we always make 
the following no tie occurring assumption: 

Assumption 53: No ties occur in 0*(£; m, rj) for any £ in the closure of {£ }. 
From the algorithm, nz(f3 ( - j ') < m is obvious. To prove the function value decreasing property, we 
introduce the following lemma. 

Lemma C.l: f3 = # (£; m, r]) is a globally optimal solution to 

™in\U - + |||/3|| 2 =: / (/3; rj) s.t. nz(/3) < m. (36) 

Let I C J 7 with |/| = m. Assuming /3 /c = 0, we get the optimal solution /3 with (3 I = + if). It 

follows that f((3;i]) = 2 (i+ v ) ^2fei ll^/ili- Hence the group quantile thresholding Q#(£;m,r]) yields a 
global minimizer. 

Based on this lemma, (fT9l) can be proved following the lines of Appendix [A] Moreover, for /3° = 
lim^oo j3^ l \ we obtain 

/3 ° = e#( / 3° + X T (y-X^);m,r / ), 

due to the no-tie-occurring assumption. 

Let nz° = nz(/3°). By the definition of # , /3° nz o = (3° nz0 /(l + rj) + X^ z o{y - X (3°) / {1 + rj) , and thus 
r]{3° nz o + X T nz o{y — X„ 2 o/3° 20 ) = 0. But this is the KKT equation of the convex optimization problem 

• II V Il2 1 V n ||2 /-O-TN 

mm- 2/-X77 2 + - 7 2 (37) 
7 ^ ^ 

with / = nz° given. Therefore, (3° is a ridge regression estimate restricted to X nz o. 

To show the stability result, we make an additional model identifiability assumption. Let /3j(r/) denote 
the ridge regression estimate restricted to Xj (which minimizes (|37l) ). 

Assumption <£(m, rj). Given rj > and m (1 < in < \J-\), there exist no 1,1' C T such that F(/3j(r]); rj) = 
F(Pj,(t]);t]) with \I\ < m, |/'| < m. 
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Similar to Appendix lAl we know that G{(3^\ (3^\ rj) has a unique limit given by F((3°;rf) 
for any limit point (3°. Under Assumption £(771,77), (3° must be unique, i.e., lim^oo/3^ = (3°. Suppose 
\nz°\ = m. Then as j is sufficiently large, is arbitrarily close to (3°, and so nz(/3^') = nz°. 

Now assume \nz°\ < m. Because f3^> is a finite-dimensional vector having m nonzero blocks, there 
must exist an index set nz': T D nz' D nz° such that \nz'\ = m, nz{f3^ 1 ') = nz' for a subsequence 
pih) of £0 L et £° = [3° + X T (y - X(3°). From /3° = e # (£°;m,77), \nz°\ < m and /3° nz o )c = 0, the 
(\nz° \ + l)th largest £j must be exactly zero. Therefore, for any / e nz' \ X^(y — X nz i(3° nz ,) = 0. 
We then obtain r]f3° nz , + X T nz ,{y — X nz i(3° nz ,) = 0, which means (3° nz , is a ridge estimate restricted to X nz / 
due to the convexity of (T37T ). But this contradicts Assumption (£(771,77), and so < m cannot occur. 
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